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(54) Data display method and apparatus for use in text mining 



(57) In a text mining technique, if the system only 
extracts characteristic words and phrases frequently 
cooccurring with the respective components of an anal- 
ysis axis as an analysis condition, similar words and 
phrases are extracted for any component (Figs. 10 to 
1 2). To clearly indicate existence of characteristic words 
and phrases which do not appear as cooccurrence 
words and phrases for other components of the analysis 
axis, it is desired to appropriately present distinguisha- 



ble features between the components to the user. For 
this purpose, the frequency of appearances of a plurality 
of characteristic words and phrases in a document sat- 
isfying each analysis condition is calculated (steps 701 
and 903). As a result, multiple cooccurrence words and 
phrases and component-cooccurrence words and 
phrases are discriminatively displayed. It is therefore 
possible for the user to appropriately analyze the con- 
tents of a plurality of documents. 



CM 
< 
O) 

CO 

CO 
CO 
CM 



CL 
LU 



100 
101 



CPU |~102 104- 
103 



FDD 











FIG.1 



— 105 



111- 

112- 
113- 
114- 

116- 

117- 

118- 



RETR1EVAL 
PROGRAM 



110 
=t 



OBJECTIVE 
-\ DOCUMENT SET 
CREATION PROGRAM 



CHARACTERISTIC WORDS AND 
PHRASES EXTRACTION PROGRAM 



ANALYSIS AXIS SETTING PROGRAM 



COOCCURRENCE WORDS AND 
PHRASES ACQUISITION PROGRAM 



SYSTEM 
CONTROL 
PROGRAM 



MULTIPLE 
COOCCURRENCE L 
WORDS AND PHRASES 
ACQUISITION PROGRAM 



SIMILAR TOPIC 
EXTRACTION 
PROGRAM 



109 



-115 



WORK AREA 



TEXT FILE 



— 106 



107 



— 108 



Printed by Jouve, 75001 PARIS (FR) 



1 



EP 1 233 349 A2 



2 



Description 

BACKGROUND OF THE INVENTION 

[0001 ] The present invention relates to a data display 5 
method and a data display apparatus in which various 
data is acquired, from a data base of documents before- 
hand registered thereto, for a set of specified documents 
and the acquired data is displayed. 
[0002] With recent development of word processors, 
personal computers, and the like, the amount of elec- 
tronic information generated by such word processors 
and personal computers are increasing. Moreover, the 
amount of electronic information available via worldwide 
web (WWW), e-mail, newswire, and the like are rapidly 
increasing. In firms and companies, it is quite important 
to analyze the contents of such electronic information 
for efficient use thereof. 

[0003] In general, most electronic information is de- 
scribed in texts, that is, in a format of statements. The 
text information, for example, the contents of a ques- 
tionnaire of free answer type cannot be easily analyzed 
by computers or the like and hence have been hereto- 
fore analyzed by human power. However, the informa- 
tion analysis by human power is attended with problems 
as follows. (1) The pertinent person in charge of analysis 
must read all documents for the processing. Therefore, 
when the amount of documents is largely increased, this 
method is not practical. (2) The information analysis is 
carried out according to subjective judgement of the us- 
er. Therefore, the results of information analysis vary de- 
pending-on knowledge and skill of the user. 
[0004] Therefore, an increasing need exists for a text 
mining technique as a technique to support the informa- 
tion analysis by human power. Agrawal et al U.S. Patent 
6,006,223 entitled "Mapping Words, Phrases Using Se- 
quential-Pattern To Find User Specific Trends In a Text 
Database" issued on December 21, 1999 concretely de- 
scribes a processing procedure of text mining. This will 
be referred to as prior art 1 herebelow. In the text mining, 
a search or retrieval is made through text information 
beforehand registered to detect new knowledge accord- 
ing to, for example, cooccurrence of words and phrases, 
a tendency of occurrence of words and phrases con- 
tained in the information to be processed. Specifically, 
for a set of processing objective documents, an analysis 
axis representing points of view for analysis is set to ac- 
quire words and phrases representing features or char- 
acteristics of a set of documents according to a corre- 
spondence to constituent components of the analysis 
axis. In this expression, "to acquire words and phrases 
according to a correspondence to constituent compo- 
nents of the analysis axis" means, for example, "to ac- 
quire words and phrases which cooccur in a predeter- 
mined range with constituent components of the analy- 
sis axis." By referring to the words and phrases, the user 
can recognize a tendency of a set of documents. Fig. 2 
shows an example of analysis in which a set of news 



items of "0157" in newspapers are analyzed using "the 
month of report or publication of the pertinent news item- 
as the analysis axis. That is, the analysis condition is 
expressed as "news item reported in 'July'", "news item 
reported in 'August"', and the like. In the analysis using 
the publication month as the analysis axis, words "infec- 
tion, patient, symptom, hospitalization, etc." are ac- 
quired in association with "July" as a component of the 
analysis axis; words "damage, provision of means, hos- 
pitafization, group infection, etc." are acquired in asso- 
ciation with "August" as a component of the analysis ax- 
is; words "sales amount, minus, foods, perishable, etc." 
are acquired in association with "September" as a com- 
ponent of the analysis axis; and so on. By referring to 
the words, the user can obtain a tendency that the set 
of documents contains topics: "Patients infected 
with "0157 disease-causing bacteria" are hospitalized" 
in "July", "Group infection with "0157 bacteria" through 
provision of meals" in "August", and "Sales amount of 
perishable foods and the like lowered due to influence 
of 0157". 

[0005] Fig. 3 shows an example of a processing pro- 
cedure of prior art 1 in a problem analysis diagram 
(PAD). In step 300, a set of documents is specified as 
an object of the text mining. In a case of a questionnaire 
in which a pertinent document database contains doc- 
uments collected according to predetermined points of 
view, the database is directly specified as an objective 
document set. I n a case of items of newspapers in which 
the database contains documents gathered according 
to various points of view such as politics, economy, 
sports, and the like, a full text search is conducted ac- 
cording to an analysis purpose of the user to specify a 
set of documents. "A full text search" is a technique in 
which all texts of the documents as the processing ob- 
jects are inputted to a pertinent computer system to 
thereby generate a database in a registration stage. In 
a retrieval stage, in response to a character string spec- 
ified by the user, ail documents containing the character 
string are retrieved from the database. For example, Ka- 
to et al U.S. Patent 6,094,647 entitled "Presearch Type 
Document Search Method and Apparatus" assigned to 
the present assignee describes the full text search in 
detail. This technique will be referred to as prior art 2 
herebelow. In step 301 , characteristic words and phras- 
es, namely, words and phrases which characterize the 
contents are extracted from the set of documents spec- 
ified in step 300. The characteristic words and phrases 
may be extracted by referring to a dictionary or by using 
statistical information. The characteristic words and 
phrases are not limited to words. For example, when the 
dictionary contains a complex word including two or 
more words, for example, "disease-causing colon bacil- 
lus", the characteristic words and phrases extracted in 
step 301 may include tow or more words. Conversely, 
the characteristic words and phrases to be extracted 
may be limited to a word. In step 302, an analysis axis 
is set as points of view for the analysis. In this example, 
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"date", "age", "sex", and the like assigned as biblio- 
graphical information items of a document are specified 
as the analysis axis or words and phrases specified by 
the user are set as constituent components of the anal- 
ysis axis. For example, when it is desired to acquire dif- 5 
ference of awareness or consciousness by age from a 
questionnaire, the age is set as the analysis axis. In this 
situation, values representing ages such as "20" and 
"30" are specified as components of the analysis axis. 
Finally, in step 303, processing of step 304 is repeatedly 
executed for the components of the analysis axis set in 
step 302. In step 304, a search is made through the 
characteristic words and phrases extracted in step 301 
to extract words and phrases strongly related to the 
components of the analysis axis, for example, a cooc- 
currence word/phrase which cooccurs in a predeter- 
mined range. The predetermined range is specified, for 
example, "within one document", "within one para- 
graph", "within one sentence" or "within m or n words 
(m and n are integers)." In prior art 1 , words and phrases 
are obtained by establishing correspondence to the 
components of the analysis axis to thereby help the user 
recognize a tendency of the set of documents. As 
above, since the words and phrases characterizing the 
pertinent set of documents are automatically obtained 
by establishing correspondence to the components of 
the analysis axis in prior art 1 , the load imposed on the 
user can be reduced and the difference in the analysis 
results between users can be minimized. 

SUMMARY OF THE INVENTION 

[0006] According to prior art 1 , the words and phrases 
characterizing the pertinent set of documents are auto- 
matically obtained by establishing correspondence to 
the components of the analysis axis. Therefore, it is pos- 
sibly to minimize the load imposed on the user described 
above, and the fluctuation or dispersion of the analysis 
resultant from respective knowledge and skill of users 
can be minimized. 

[0007] However, prior art 1 is attended with a problem 
as below. As can be seen from an analysis example of 
Fig. 4, when the words and phrases with a high frequen- 
cy of cooccurrence with each component of the analysis 
axis are simply extracted from the set of documents, the 
same words and phrases italicized in Fig. 4 such as "dis- 
ease-causing colon bacillus", "food poisoning", "infec- 
tion" and "group" are extracted for any component. That 
is, cooccurrence words and phrases such as "patient" 
and "symptom" of "July" and "inspection" and "foods" of 
"August" which rarely appears for other components of 
the analysis axis are ignored. It is therefore not possible 
to appropriately present a different point with respect to 
meaning between the components of the analysis axis 
to the user. 

[0008] It is therefore an object of the present invention 
to provide a data display method and a data display ap- 
paratus in which the user can suitably analyze the con- 



4 

tents of a plurality of documents. 
[0009] According to one aspect of the present inven- 
tion, a frequency of appearances of a plurality of words 
and phrases in a document satisfying each analysis 
condition is calculated and the words and phrases are 
displayed according to a result of the calculation. 
[0010] Another object of the present invention is to 
provide a document processing system which supports 
a text mining function to clarify similar points and differ- 
ent points of words and phrases cooccurring with each 
component of an analysis axis so that the user can ap- 
propriately analyze a tendency of a set of the docu- 
ments. 

[001 1 ] To achieve the objects according to one aspect 
of the present invention, there is provided a text mining 
method including a characteristic words and phrases 
extraction step of collecting, from a set of documents 
beforehand registered, all of or part of the documents 
into a set of processing objective documents and of ex- 
tracting therefrom words and phrases characteristically 
appearing therein, a mining scheme creation step of set- 
ting definition information or a mining scheme contain- 
ing components specified, a cooccurrence words and 
phrases acquisition step of acquiring, from the words 
and phrases extracted by the characteristic words and 
phrases extraction step, cooccurrence words and 
phrases cooccurring in a predetermined range with 
each component contained in the mining scheme, and 
a multiple cooccurrence words and phrases extraction 
step of comparing cooccurrence words and phrases be- 
tween the elements or components contained in the 
mining scheme, of acquiring, as multiple cooccurrence 
words and phrases, cooccurrence words and phrases 
related to many components contained in the mining 
scheme, and creating component-cooccurrence words 
and phrases by removing the multiple cooccurrence 
words and phrases from the cooccurrence words and 
phrases of the respective components. 



[0012] The objects, features and advantages of the 
present invention will become more apparent from the 
following detailed description of the embodiments of the 
invention when taken in conjunction with the accompa- 
nying drawings in which: 

Fig. 1 is a schematic block diagram showing struc- 
ture of an embodiment according to the present in- 
vention; 

Fig. 2 is a schematic diagram for explaining prior art 
t; 

Fig. 3 is a PAD showing the contents of processing 
of prior art t ; 

Fig. 4 a schematic diagram for explaining a problem 
of prior art 1 ; 

Fig. 5 is a diagram exemplifying the contents of 
processing of multiple occurrence words and 
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phrase extraction of the present invention; 
Fig. 6 is a diagram showing a display format of 
words and phrases extracted from a retrieval objec- 
tive document according to an embodiment of the 
present invention; 

Fig. 7 is a PAD showing steps to generate a set of 
multiple cooccurrence words and phrases and a set 
of component-cooccurrence words and phrases ac- 
cording to an embodiment of the present invention; 
Fig. 8 is a PAD showing the contents of processing 
of a similar topic extraction process in an embodi- 
ment of the present invention; 
Fig. 9 is a PAD showing steps to create (a set of) 
component-cooccurrence words and phrases in an 
embodiment of the present invention; 
Fig. 10 is a process diagram showing the contents 
of processing to analyze document extraction char- 
acteristic words and phrases in an embodiment of 
the present invention; 

Fig. 1 1 is a diagram showing the contents of multiple 
cooccurrence words and phrases acquisition 
processing in an embodiment of the present inven- 
tion; and 

Fig. 12 is a diagram showing the contents of multi- 
ple cooccurrence words and phrases removal 
processing in an embodiment of the present inven- 
tion. 

DETAILED DESCRIPTION OF THE EMBODIMENTS 

[0013] Prior to explanation of an embodiment of the 
present invention, description will be given of the prin- 
ciple of the present invention using the document re- 
trieval method. When a text mining execution indication 
is inputted, a set of documents as an object of the text 
mining is accessed to extract therefrom characteristic 
words and phrases characterizing the contents to ob- 
tain, from the extracted characteristic words and phras- 
es, words and phrases strongly or deeply related to 
components of a specified analysis axis, for example, 
cooccurrence words and phrases cooccurring in a pre- 
determined range. The contents of the processing are 
similar to those of prior art 1 . As a result, cooccurrence 
words and phrases can be obtained for the respective 
components of the analysis axis as shown in Fig. 4. In 
the present invention, the cooccurrence words and 
phrases are compared between the components of the 
analysis axis to acquire, as multiple cooccurrence words 
and phrases, words and phrases related to many com- 
ponents. By removing the multiple cooccurrence words 
and phrases from the cooccurrence words and phrases 
of the respective components, component-cooccur- 
rence words and phrases are created. 
[0014] A concrete example of the processing will be 
described by referring to Figs. 5 and 6. First, cooccur- 
rence words and phrases related to many components 
of the analysis axis are obtained as multiple cooccur- 
rence words and phrases. In the example shown in Fig. 



5, "disease-causing colon bacillus", "food poisoning", 
"infection", "group", etc. are obtained as cooccurrence 
words and phrases for the most components. These 
words and phrases are obtained as multiple cooccur- 

5 rence words and phrases. In the example of this dia- 
gram, although the words and phrases as the cooccur- 
rence words and phrases of many components are sim- 
ply obtained as multiple cooccurrence words and phras- 
es, weighting may be conducted according to a se- 

10 quence of cooccurrence words and phrases and/or 
strength of cooccurrence of the pertinent words and 
phrases. The strength of cooccurrence is indicated by 
a value calculated according to the number of cooccur- 
rences between the respective components and the 

15 pertinent word/phrase or between other components 
and the pertinent word/phrase. For example, a charac- 
teristic word/phrase which rarely cooccurs with other 
components, but cooccurs many times with the perti- 
nent component has greater strength of cooccurrence 

20 with the pertinent component. Next, elemental cooccur- 
rence words and phrases or component-cooccurrence 
words and phrases are created by removing the multiple 
cooccurrence words and phrases from the cooccur- 
rence words and phrases of the respective components. 

25 in the example shown in Fig. 5, the multiple cooccur- 
rence words and phrases (italicized inthediagram) such 
as "disease-causing colon bacillus", "food poisoning", 
"infection", "group", etc. are removed from the cooccur- 
rence words and phrases of the respective components 

30 to create component-cooccurrence words and phrases. 
To display results of the processing to the user, the mul- 
tiple cooccurrence words and phrases may be present- 
ed as similar topics of components of the analysis axis 
and the component-cooccurrence words and phrases 

35 are presented as topics of the respective components, 
for example, as shown in Fig. 6. In this diagram, each 
of the values displayed as importance indicate a degree 
of relationship to components, namely, the number of 
components to which the pertinent word/phrase is relat- 

^0 ed. Moreover, if the restriction of the cooccurrence 
words and phrases to be obtained as multiple cooccur- 
rence words and phrases is relaxed to extract cooccur- 
rence words and phrases with a lower degree of impor- 
tance as the multiple cooccurrence words and phrases, 

45 cooccurrence words and phrases unique to the respec- 
tive components can be obtained as component-cooc- 
currence words and phrases. Therefore, it is possible to 
present topics unique to the respective components. 
[0015] In this method described above, the cooccur- 

50 rence words and phrases are compared between the 
components of the analysis axis such that cooccurrence 
words and phrases related to many components are ob- 
tained as multiple cooccurrence words and phrases. 
The component-cooccurrence words and phrases are 

55 created by removing the multiple cooccurrence words 
and phrases from the cooccurrence words and phrases 
of the respective components. Resultantly, It Is possible 
to clarify the similar points of the respective components 
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of the analysis axis as the multiple cooccurrence words 
and phrases and the differences therebetween as com- 
ponent-cooccurrence words and phrases. Therefore, it 
is possible to provide a document processing system in 
which the user can appropriately analyze a tendency of 
a set of documents. 

[0016] The principle of the present invention will be 
described by referring to the PAD shown in Fig. 7. When 
an indication of text mining execution is inputted, a set 
of documents as an object of the text mining is specified 
in step 300. In step 301 , characteristic words and phras- 
es characterizing the contents are extracted from the set 
of documents specified in step 300. In step 302, an anal- 
ysis axis is set as points of view for the analysis. In step 
303, processing of step 304 is repeatedly executed for 
each component of the analysis axis set in step 302. In 
step 304, the characteristic words and phrases extract- 
ed in step 301 is accessed to obtain therefrom words 
and phrases strongly related to the pertinent component 
of the analysis axis, for example, cooccurrence words 
and phrases cooccurring in a predetermined range. The 
contents of processing from step 300 to step 304 are 
similar to those of prior art 1 . As a result, cooccurrence 
words and phrases corresponding to the respective 
components of the analysis axis can be obtained as 
shown in Fig. 4. Moreover, according to the present in- 
vention, the cooccurrence words and phrases are com- 
pared between the components of the analysis axis and 
cooccurrence words and phrases related to many com- 
ponents are obtained as multiple cooccurrence words 
and phrases. Thereafter, component-cooccurrence 
words and phrases are created by removing the multiple 
cooccurrence words and phrases from the cooccur- 
rence words and phrases of the respective components. 
According to the present invention, when an indication 
of similar topic extraction is inputted in step 700, the 
cooccurrence words and phrases related to many com- 
ponents of the analysis axis are obtained as multiple 
cooccurrence words and phrases. 
[0017] Description will now be given of an embodi- 
ment of the present invention by referring to the accom- 
panying drawings. 

[0018] Fig. 1 shows constitution of a document 
processing system according to an embodiment of the 
present invention in a block diagram. The document 
system according to the present invention includes as 
shown in Fig. 1 , a display 1 00, a keyboard 1 01 , a central 
processing unit (CPU) 102, a floppy disk drive (FDD) 
104, a magnetic disk device 106, a main memory 108, 
and a bus 103 connecting the constituent components 
to each other. The magnetic disk device 106 is a sec- 
ondary storage to store a text file 107. Information stored 
in the floppy disk 1 05 is accessed by the floppy disk drive 
104. The floppy disk drive 104 and the magnetic disk 
device 1 06 may be configured to be connected to other 
devices connected, for example, via a communication 
line, not shown in Fig. 1 , to each other. 
[0019] Stored in the main storage 108 are a system 



control program 1 09, an objective document set crea- 
tion program 110, a retrieval program 111, a character- 
istic words and phrases extraction program 112, an 
analysis axis setting program 113, a cooccurrence 
5 words and phrases acquisition program 114, a similar 
topic extraction program 115, a multiple cooccurrence 
words and phrases acquisition program 116, and a mul- 
tiple cooccurrence words and phrases removal program 
1 1 7. Additionally, a work area 1 1 8 is reserved in the main 
storage 1 08. These programs may be stored on a com- 
puter-readable recording medium such as a magnetic 
disk 1 06 or a floppy disk 1 05. 

[0020] Description will next be given of the processing 
executed by the embodiment of the present invention 
by referring to Fig. 8. When a text mining execution in- 
dication from the keyboard 1 01 , a function call from an- 
other program, or the like is received, the system control 
program 109 starts its operation to control the objective 
document set creation program 110, the characteristic 
words and phrases extraction program 1 1 2, the analysis 
axis setting program 113, the cooccurrence words and 
phrases acquisition program 114, and the similar topic 
extraction program 115. 

[0021] In step 800, the system control program 109 
initiates the document set creation program 110 to ac- 
cess the text file 1 07 to accordingly create a set of doc- 
uments as an object of the processing. When the text 
file 1 07 is a document database of documents collected 
according to predetermined points of view, for example, 
of a questionnaire, the document database may be di- 
rectly set as the objective document set. Alternatively, 
when the text file 1 07 is a document database of docu- 
ments of, for example, newspapers and documents are 
gathered according to various points of view such as 
politics, economy, sports, and the like, a full text search 
may be conducted according to an analysis purpose of 
the user to specify a set of documents. When the full 
text search or the like is used to create the objective 
document set, the objective document set creation pro- 
gram 110 initiates the retrieval program 111 to make a 
retrieval operation through the text file 1 07 using a spec- 
ified retrieval condition. As a result, a set of documents 
thus retrieved is created as the objective document set. 
The retrieval program 111 includes an existing retrieval 
technique like that of prior art 2. In step 801 , the docu- 
ment set creation program 1 1 0 initiates the characteris- 
tic words and phrases extraction program 1 1 2 to extract, 
from the objective document set created in step 800, 
characteristic words and phrases characterizing the 
pertinent contents. The characteristic words and phras- 
es may be extracted by referring to, for example, a dic- 
tionary or by using statistical information. Furthermore, 
words/phrases having the same meaning may be col- 
lected using a thesaurus or the like to be replaced with 
one word/phrase. The characteristic words and phrases 
to be extracted are not limited to words. For example, 
when the dictionary includes a complex word including 
two or more words, the characteristic word/phrase ex- 
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tracted in this step may include two or more words. Con- 
versely, the characteristic word/phrase to be extracted 
may be limited to one word. 

[0022] In step 802, the program 1 1 0 initiates the anal- 
ysis axis setting program 113 to set an analysis axis as 
points of view for the analysis. In this case, "date", "age" , 
"sex", and the like assigned as bibliographical informa- 
tion of a document are specified as the analysis axis or 
words and phrases specified by the user are set as com- 
ponents of the analysis axis. For example, to acquire 
difference of awareness or consciousness by age from 
a questionnaire, the age is set as the analysis axis. In 
this situation, values representing ages such as "20" 
and "30" are specified as components of the analysis 
axis. In step 803, the program 110 initiates the cooccur- 
rence words and phrases acquisition program 1 14 to re- 
peatedly execute processing of step 804 for the compo- 
nents of the analysis axis set in step 802. 
[0023] In step 804, from the characteristic words and 
phrases extracted in step 801, words and phrases 
strongly related to the components of the analysis axis 
are obtained. For example, when "age", "sex", and the 
like assigned as bibliographic information items are 
specified as components of the analysis axis, charac- 
teristic words and phrases extracted from documents to 
which the pertinent bibliographic information is assigned 
are obtained as words and phrases strongly related to 
the bibliographic information. For example, when "age" 
is set as an analysis axis in the example of the ques- 
tionnaire, characteristic words and phrases extracted 
from a document to which "age is 20" is assigned are 
obtained as words and phrases strongly related to the 
component "20". 

[0024] When a specified word/phrase is set as a com- 
ponent of the analysis axis, cooccurrence words and 
phrases cooccurring with the specified word/phrase are 
acquired, for example, within a predetermined range. 
The predetermined range is specified, for example, 
"within one document", "within one paragraph", "within 
one sentence", or "within m or n words (m and n are 
integers)." The processing from step 800 to step 804 is 
similar to that of prior art 1. In this embodiment, when 
the similar topic extraction indication is received from 
the keyboard 101 or when a function call is received 
from another program in step 805, the similar topic ex- 
traction program 115 is initiated in step 806 to conduct 
similar topic extraction. 

[0025] Fig. 9 shows the processing of the similar topic 
extraction by the similar topic extraction program 115. 
In step 900, for each characteristic word/phrase ob- 
tained in step 801 , a degree of importance is calculated 
according to the number of cooccurrence components 
of the analysis axis. In step 901 , any characteristic word/ 
phrase having a degree of importance (step 900) ex- 
ceeding a predetermined value is extracted as a multiple 
cooccurrence word/phrase. In step 902, the multiple 
cooccurrence words and phrases removal program 1 1 7 
is initiated to repeatedly execute processing of step 903 



for the components of the analysis axis. In step 903, 
component-cooccurrence words and phrases are creat- 
ed by removing the multiple cooccurrence words and 
phrases obtained in step 901 from the cooccurrence 
5 words and phrases of the pertinent component. 

[0026] Referring now to Fig. 8, description will be giv- 
en in detail of the processing of this embodiment. In step 
800, the system control program 109 initiates the docu- 
ment set creation program 110 in which documents as 
an object of the processing are selected form the text 
file 107 to be collected as a document set for the 
processing. When the text file 107 is a document data- 
base including documents collected according to be- 
forehand determined points of view, for example, of a 
questionnaire, the document database may be set as 
the objective document set. Conversely, when the text 
file 107 is a document database of documents of, for 
example, newspapers and documents are gathered ac- 
cording to various points of view such as politics, econ- 
omy, sports, and the like, a full text search may be con- 
ducted according to an analysis purpose of the user to 
select documents to thereby create a set of documents. 
When the full text search or the like is used to create the 
objective document set, the objective document set cre- 
ation program 110 initiates the retrieval program 111 to 
make a retrieval through the text file 107 using a spec- 
ified retrieval condition. As a result, a set of documents 
thus retrieved is created as the objective document set. 
For the retrieval program 111 , an existing retrieval tech- 
nique like that of prior art 2 is employed. 
[0027] Fig. 10 shows an example of text mining for 
news items regarding the "0157 disease-causing bac- 
teria" in a newspaper database. In this example shown 
in Fig. 10, a newspaper database is stored in the text 
file 107 in advance. By executing the retrieval program 
111, the pertinent database is limited to contain only 
news items including "01 57" to obtain processing objec- 
tive document set including document 0012, document 
0130, document 0293, document 0535, document 
0829, etc. If the objective documents are structured doc- 
uments, the documents may be limited such that each 
document contains "0157" in any structure. 
[0028] In step 801, the characteristic words and 
phrases extraction program 112 is initiated to extract, 
from the objective document set created in step BOO, 
characteristic words and phrases characterizing the 
contents. The characteristic words and phrases may be 
extracted by referring to, for example, a dictionary or by 
using statistical information. Furthermore, words/phras- 
es having the same meaning may be collected using a 
thesaurus or the like to be replaced with one word/ 
phrase. U.S. Patent 6,047,299 issued on April 4, 2000 
(Kaijima) proposes an example of the thesaurus such 
as an electronic terminology dictionary used for the sup- 
port of editing and translation of a document. The char- 
acteristic words and phrases to be extracted are not lim- 
ited to words. For example, when a complex word in- 
cluding two or more words is contained in the dictionary, 
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the characteristic word/phrase extracted in this step 
may include two or more words. Conversely, the char- 
acteristic word/phrase to be extracted may be limited to 
one word. In the example of Fig. 10, from the objective 
document set created in step 800, there are extracted 5 
characteristic words and phrases "elementary school, 
group, infection, disease-causing colon bacillus, food 
poisoning, patient, stomachache, bleeding, diarrhea, 
symptom, hospitalization, family, secondary infection, 
supermarket, perishable foods, sales amount, to 
damage, 

[0029] In step 802, the analysis axis setting program 
113 is initiated to set an analysis axis as points of view 
for the analysis. In this case, "date", "age", "sex", and 
the like assigned as bibliographical information items of ts 
a document are specified as the analysis axis or words 
and phrases specified by the user are set as compo- 
nents of the analysis axis. In the example shown in Fig. 
10, "news items published in 'July 1 ", "news items pub- 
lished in "August 1 ", etc. are specified as analysis condi- 20 
tions. In step 803, the cooccurrence words and phrases 
acquisition program 114 is initiated to repeatedly exe- 
cute processing of step 804 for the components of the 
analysis axis specified in step 802. 

[0030] In step 804, from the characteristic words and 25 
phrases extracted in step 801, words and phrases 
strongly related to the pertinent component of the anal- 
ysis axis are obtained. In the example of Fig. 10, a bib- 
liographic information item of newspaper, i.e., the month 
in which items are published is set as the component of 30 
the analysis axis. Therefore, "disease-causing colon ba- 
cillus, food poisoning, infection, measures, hygiene, ..." 
are extracted as words and phrases strongly related to 
the component of the analysis axis, i.e., "July" from the 
newspaper items published in "July". In the display 35 
method of the words and phrases, the words and phras- 
es may be sorted in a sequence of frequency of appear- 
ances thereof in the newspaper items published in "Jul/ 1 
to be displayed as the words and phrases deeply related 
to "July". Alternatively, the words and phrases may be *o 
sorted in an ascending sequence of frequency of ap- 
pearances in the overall database such that the words 
and phrases less frequently appear in the database are 
distinguishably displayed in the starting part of the list 
[0031 ] That is, the items above means that the words 45 
and phrases "disease-causing colon bacillus, food poi- 
soning, infection, measures, hygiene, ..."frequently ap- 
pear in the newspaper items published in "July". Simi- 
larly, as the words and phrases deeply related to "Au- 
gust", "disease-causing colon bacillus, infection, food so 
poisoning, measures, group, ..." are obtained from the 
newspaper items published in "August". Additionally, as 
the words and phrases deeply related to "September", 
"disease-causing colon bacillus, food poisoning, meas- 
ures, group, infection ..." are obtained from the newspa- 55 
per items published in "September". The processing 
from step 800 to step 804 is similar to that of prior art 1 . 
[0032] In this embodiment, when the similar topic ex- 



traction indication is received from the keyboard 101 or 
when a function call is received from another program 
in step 805, the similar topic extraction program 115 is 
initiated in step 806 to conduct similar topic extraction. 
Referring next to Fig. 9, description will be given in detail 
of the similar topic extraction. 

[0033] In step 900, for each characteristic word/ 
phrase obtained in step 801 , the similartopic extraction 
program 1 1 5 calculates a degree of importance accord- 
ing to the number of cooccurrence components of the 
analysis axis. 

It can be understood from the example of Fig. 1 0, forthe 
characteristic word/phrase "disease-causing colon ba- 
cillus", cooccurrence takes place for all of the six com- 
ponents of the analysis axis. Therefore, the degree of 
importance is calculated as, for example, 6/6 x 100 = 
100% . Furthermore, forthe characteristic word/phrase 
"group food poisoning", cooccurrence takes place for 
four components of the analysis axis. Therefore, the de- 
gree of importance is calculated as, for example, 4/6 x 
100 = 67%. In the operation, the characteristic words 
and phrases may be sorted in a descending order of fre- 
quency of appearances for each component. For a char- 
acteristic word/phrase of a predetermined sequential 
position and characteristic words and phrases following 
the characteristic word/phrase, the degree of impor- 
tance is regarded as lower importance although cooc- 
currence exists for the respective components, and 
hence these characteristic words and phrases are not 
taken into consideration when the frequency of appear- 
ances is counted. Additionally, for example, for charac- 
teristic words and phrases of which the frequency of ap- 
pearances in the newspaper items published in "July" is 
less than a predetermined value, it may be considered 
that cooccurrence does not exist with "July", and hence 
these characteristic words and phrases are not taken 
into consideration when the frequency of appearances 
is counted. 

[0034] In step 901, any characteristic word/phrase 
with the degree of importance (step 900) exceeding a 
predetermined value is extracted as a multiple cooccur- 
rence word/phrase. 

[0035] Fig. 11 shows an example of multiple cooccur- 
rence words and phrases acquisition, in the example of 
Fig. 11, multiple cooccurrence words and phrases are 
acquired from ten higher characteristic words and 
phrases with respect to the frequency of cooccurrence 
selected from the characteristic words and phrases 
cooccurring with the respective components of the anal- 
ysis axis shown in Fig. 10. Assume that the threshold 
value is set to "50%". For example, words and phrases 
"disease-causing colon bacillus", "food poisoning", "in- 
fection", and "group" cooccur with all components of the 
analysis axis in a range from "July" to "December". 
Therefore, for these words and phrases, the degree of 
importance is calculated as 1 00%. This consequently 
exceeds the threshold value "50%", and hence these 
words and phrases are obtained as multiple cooccur- 
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rence words and phrases. The word "measure" cooc- 
curs with five components excepting "October" among 
six components. Therefore, the degree of importance 
thereof is calculated as 83%. This consequently ex- 
ceeds the threshold value "50%", and hence the word 
"measure" is obtained as one of the multiple cooccur- 
rence words and phrases. 

[0036] In step 902, the multiple cooccurrence words 
and phrases removal program 1 1 7 is initiated to repeat- 
edly execute processing of step 903 for the components 
of the analysis axis. In step 903 , the multiple cooccur- 
rence words and phrases obtained in step 901 are re- 
moved from the cooccurrence words and phrases of the 
pertinent component to thereby create component- 
cooccurrence words and phrases. Rg. 12 shows an ex- 
ample of the removal of multiple cooccurrence words 
and phrases. In the example of Fig. 12, "disease-caus- 
ing colon bacillus", "food poisoning", "infection", 
"group", etc. obtained as multiple cooccurrence words 
and phrases are removed from the cooccurrence words 
and phrases of the respective components to create 
component-cooccurrence words and phrases. 
[0037] As can be seen from Fig. 6, when presenting 
the results of the operation above to the user, the mul- 
tiple cooccurrence words and phrases may be displayed 
as similar topics of the components of the analysis axis 
and the component-cooccurrence words and phrases 
are displayed as topics of the respective components. 
In Fig. 6, the value indicated as a degree of importance 
is a degree of depth or strength of a relationship repre- 
sented by the number of related components. It may al- 
so be possible to relax the restriction of cooccurrence 
words and phrases to be obtained as multiple cooccur- 
rence words and phrases such that cooccurrence words 
and phrases with a lower degree of importance are ex- 
tracted as multiple cooccurrence words and phrases. 
Resultantly, cooccurrence words and phrases unique to 
the respective components are obtained as the compo- 
nent-cooccurrence words and phrases. Therefore, it is 
possible to present topics unique to the respective com- 
ponents to the user. Moreover, the system may be con- 
figured such that the user can make a selection on a 
screen to display either one of or both of the multiple 
cooccurrence words and phrases and the component- 
cooccurrence words and phrases as results of the op- 
eration. It is also possible that the user can specify on 
a screen a threshold value of the degree of importance 
for the cooccurrence words and phrases to be extracted 
as the multiple cooccurrence words and phrases. 
[0038] Description has been given in detail of the con- 
tents of processing executed by the embodiment. In the 
method of the embodiment described above, the cooc- 
currence words and phrases are compared between the 
components of the analysis axis to obtain, as multiple 
cooccurrence words and phrases, cooccurrence words 
and phrases related with many components. Thereafter, 
component-cooccurrence words and phrases are gen- 
erated by removing the multiple cooccurrence words 



and phrases from the cooccurrence words and phrases 
of the respective components. Therefore, similar points 
of the respective components of the analysis axis can 
be presented as the multiple cooccurrence words and 

s phrases to the user, and distinguishing features thereof 
can be presented as the component-cooccurrence 
words and phrases to the user. That is, there is imple- 
mented a text mining function which can clearly present 
the results analysis to the user as above. Consequently 

10 it is possible to provide a document processing system 
in which the user can appropriately analyze a tendency 
of a set of documents. 

[0039] In the description of the embodiment, a full text 
search is used to selectively create a set of documents. 

15 However, the similar processing is possible in a case in 
which the overall set of documents stored in the data- 
base is specified as the objective document set or in 
which a text or a document is used as a search condition 
to set a result of the search as the objective document 

20 set 

[0040] In the description of the example of the embod- 
iment, specified bibliographical information is set as the 
analysis axis for the text mining operation. However, the 
similar processing is possible also when specified words 

25 and phrases are set as components of the analysis axis 
forthe text mining operation. In this situation, character- 
istic words and phrases extracted from the objective 
document set are presented to the user. The user se- 
lects components from the presented words and phras- 

30 es or inputs particular words and phrases from the key- 
board. 

[0041 ] The specification and drawings are, according- 
ly, to be regarded in an illustrative rather than a restric- 
tive sense, it will, however, be evident that various mod- 
35 if jcations and changes may be made thereto without de- 
parting from the broader spirit and scope of the invention 
as set forth in the claims. 



1 . A data display method, comprising the steps of: 

calculating a number of appearances of a plu- 
^5 rality of words and phrases in a document of a 

set of documents, the words and phrases sat- 
isfying each analysis condition (Figs. 5 and 6); 
and 

specifically displaying the words and phrases 
so according to a result of the calculation (Fig. 6). 

2. A data display method according to claim 1 , wherein 
said each analysis condition is information related 
to the document (Figs. 4 to 6). 

55 

3. A data display method according to claim 1 , wherein 
said words and phrases display step includes dis- 
playing the words and phrases for each said anal- 
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ysis condition (Figs. 10 to 12). 

4. A text mining method, comprising: 

a characteristic words and phrases extraction s 
step (steps 300, 301 , 800, and 801) of select- 
ing, from a set of documents, all of or part of 
the documents as an objective document set 
and extracting, from the objective document 
set, words and phrases characteristically ap- 10 
pearing in the objective document set; 
a mining scheme creation step (steps 302 and 
802) of setting a mining scheme including spec- 
ified components; 

a related words and phrases extraction step 15 
(steps 304 and 8Q4) of acquiring, from the 
words and phrases extracted by said charac- 
teristic words and phrases extraction step, re- 
lated words and phrases strongly related to the 
respective specified components included in 20 
the mining scheme; and 
multiple related words and phrases extraction 
step (steps 701 and 806) of comparing the re- 
lated words and phrases between the respec- 
tive components included in the mining scheme 25 
and extracting, as multiple related words and 
phrases, those related words and phrases re- 
lated to many components included in the min- 
ing scheme. 

30 

5. A text mining method according to claim 4, wherein: 

related words and phrases strongly related to 
the respective specified components included 
in the mining scheme are cooccurrence words 35 
and phrases cooccurring in a predetermined 
range with the respective specified compo- 
nents included in the mining scheme (Fig. 10); 
said related words and phrases acquisition step 
includes a cooccurrence words and phrases 40 
acquisition step (steps 304 and 804) of obtain- 
ing, from the words and phrases extracted by 
said characteristic words and phrases extrac- 
tion step, cooccurrence words and phrases 
cooccurring in a predetermined range with the 4s 
respective specified components included in 
the mining scheme; and 
said multiple related words and phrases extrac- 
tion step includes a multiple cooccurrence 
words and phrases extraction step (step 901) so 
of comparing the cooccurrence words and 
phrases between the respective components 
included in the mining scheme and of extract- 
ing, as multiple cooccurrence words and phras- 
es, cooccurrence words and phrases related to 55 
many components included in the mining 
scheme. 



6. A text mining method according to claim 4, wherein 
the objective document set is a document set ob- 
tained by conducting a retrieval by using a word/ 
phrase, a statement, or a document as a retrieval 
condition (Fig. 10). 

7. A text mining method according to claim 4, wherein 
said multiple related words and phrases extraction 
step includes: 

a multiple related words and phrases acquisi- 
tion step (steps 700 and 701 ) of comparing the 
related words and phrases between the re- 
spective components included in the mining 
scheme and of extracting, as multiple related 
words and phrases, related words and phrases 
related to many components included in the 
mining scheme; and 

a multiple related words and phrases removal 
step (steps 902 and 903) of removing the mul- 
tiple related words and phrases from the related 
words and phrases of the respective compo- 
nents included in the mining scheme, thereby 
creating component-related words and phras- 
es. 

8. A text mining method according to claim 5, wherein 
said multiple related words and phrases extraction 
step includes: 

a multiple cooccurrence words and phrases ac- 
quisition step (steps 900 and 901 ) of comparing 
the cooccurrence words and phrases between 
the respective components included in the min- 
ing scheme and of extracting, as multiple cooc- 
currence words and phrases, cooccurrence 
words and phrases related to many compo- 
nents included in the mining scheme; and 
a multiple related words and phrases removal 
step (steps 902 and 903) of removing the mul- 
tiple related words and phrases from the related 
words and phrases of the respective compo- 
nents included in the mining scheme, thereby 
creating component-related words and phras- 
es. 

9. A text mining method according to claim 8, wherein 
the cooccurrence words and phrases related to 
many components included in the mining scheme 
are words and phrases extracted as cooccurrence 
words and phrases for at least a predetermined 
number of components (Figs. 10 to 12). 

1 0. A text mining method according to claim 8, wherein 
the cooccurrence words and phrases related to 
many components included in the mining scheme 
are words and phrases extracted as cooccurrence 
words and phrases having a value exceeding a pre- 
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determined value, the value being calculated ac- 
cording to strength of cooccurrence of the cooccur- 
rence words and phrases with each of the compo- 
nents included in the mining scheme and the 
number of the cooccurrence components (Fig. 1 0). 

11. A text mining method according to claim 9, wherein 
said multiple cooccurrence words and phrases ac- 
quisition step includes an importance calculation 
step for calculating importance of the multiple cooc- 
currence words and phrases according to a prede- 
termined calculation formula (step 900). 

12. A text mining method according to claim 10, where- 
in said multiple cooccurrence words and phrases 
acquisition step includes an importance calculation 
step for calculating importance of the multiple cooc- 
currence words and phrases according to a prede- 
termined calculation formula (step 900). 

1 3. A text mining method according to claim 1 1 , wherein 
the importance is calculated according to a prede- 
termined calculation formula using the number of 
the components associated with the multiple cooc- 
currence words and phrases (step 900, Figs. 11 and 
12). 

14. A text mining method according to claim 12, where- 
in the importance is calculated according to a pre- 
determined calculation formula using strength of 
cooccurrence of the multiple cooccurrence words 
and phrases with each of the components included 
in the mining scheme and the number of the cooc- 
currence components (step 900, Figs. 11 and 12). 

15. A text mining method according to claim 7, further 
comprising the step of a related words and phrases 
indication step of indicating the multiple related 
words and phrases obtained by said multiple relat- 
ed words and phrases acquisition step and the com- 
ponent-related related words and phrases obtained 
by the multiple related words and phrases removal 
step (Figs. 10 to 12). 

16. A text mining method according to claim 8, further 
comprising the step of a cooccurrence words and 
phrases indication step of indicating the multiple 
cooccurrence words and phrases obtained by said 
multiple cooccurrence words and phrases acquisi- 
tion step and the component-cooccurrence related 
words and phrases obtained by the multiple cooc- 
currence words and phrases removal step (Figs. 10 
to 12). 

17. A text mining method according to claim 1 1 , further 
comprising the step of a cooccurrence words and 
phrases indication step of indicating the multiple 
cooccurrence words and phrases and the impor- 
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tance obtained by said multiple cooccurrence 
words and phrases acquisition step and the com- 
ponent-cooccurrence related words and phrases 
obtained by the multiple cooccurrence words and 
5 phrases removal step (Figs. 1 0 to 1 2). 

18. A text mining apparatus, comprising; 

characteristic words and phrases extraction 
10 means (112) for selecting, from a set of docu- 

ments, ail of or part of the documents as an ob- 
jective document set and for extracting, from 
the objective document set, words and phrases 
characteristically appearing in the objective 
15 document set; 

mining scheme creation means (113) for setting 
a mining scheme including specified compo- 
nents; 

related words and phrases extraction means 
20 (114) for obtaining, from the words and phrases 

extracted by said characteristic words and 
phrases extraction means, related words and 
phrases strongly related to the respective spec- 
ified components included in the mining 
25 scheme; and 

multiple related words and phrases extraction 
means (116, 117) for comparing the related 
words and phrases between the respective 
components included in the mining scheme 
30 and of extracting, as multiple related words and 

phrases, related words and phrases related to 
many components included in the mining 
scheme. 

35 1 9. A storing medium (1 05, 1 06) having stored thereon 
a program (109 -118) to configured thereon a text 
mining system, wherein the text mining system 
comprises: 

40 a characteristic words and phrases extraction 

module (112) for selecting, from a set of docu- 
ments, all of or part of the documents as an ob- 
jective document set and extracting, from the 
objective document set, words and phrases 

45 characteristically appearing in the objective 

document set; 

a mining scheme creation module (1 1 3) for set- 
ting a mining scheme including specified com- 
ponents; 

50 a related words and phrases extraction module 

(114) for obtaining, from the words and phrases 
extracted by said characteristic words and 
phrases extraction module, related words and 
phrases strongly related to the respective spec- 

55 jfied components included in the mining 

scheme; and 

a multiple related words and phrases extraction 
module (115-117) for comparing the related 
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words and phrases between the respective 
components included in the mining scheme 
and extracting, as multiple related words and 
phrases, related words and phrases related to 
many components included in the mining 5 
scheme. 

20. A computer-executable program (109-118) for im- 
plementing a text mining method using a computer, 
wherein said text mining method comprises the to 
steps of: 

selecting, from a set of documents, all of or part 
of the documents as an objective document set 
and extracting, from the objective document 
set, words and phrases characteristically ap- 
pearing in the objective document set; 
setting a mining scheme including specified 
components; 

acquiring from the words and phrases extract- 
ed by said characteristic words and phrases ex- 
traction step, related words and phrases 
strongly related to the respective specified 
components included in the mining scheme; 
and 

comparing the related words and phrases be- 
tween the respective components included in 
the mining scheme and extracting, as multiple 
related words and phrases, related words and 
phrases related to many components included 
in the mining scheme. 

21 . A text-mining oriented data structure including mul- 
tiple related words and phrases generated from a 
document set, said multiple related words and 35 
phrases being determined by those related to more 
than a designated number of components included 

in a mining scheme, said data structure being cre- 
ated by implementing the steps of: 

40 

selecting, from a set of documents, all of or part 
of the documents as an objective document set 
and extracting, from the objective document 
set, words and phrases characteristically ap- 
pearing in the objective document set; 45 
setting a mining scheme including specified 
components; 

acquiring from the words and phrases extract- 
ed by said characteristic words and phrases ex- 
traction step, related words and phrases so 
strongly related to the respective specified 
components included in the mining scheme; 
and 

comparing the related words and phrases be- 
tween the respective components included in 55 
the mining scheme to generate a set of multiple 
related words and phrases. 
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FIG.3 
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